Skip to content

fix(envd): fix data race in fan-out when subscriber is removed mid-iteration#2643

Merged
arkamar merged 5 commits into
mainfrom
fix/envd-subscriber-data-race
May 14, 2026
Merged

fix(envd): fix data race in fan-out when subscriber is removed mid-iteration#2643
arkamar merged 5 commits into
mainfrom
fix/envd-subscriber-data-race

Conversation

@arkamar

@arkamar arkamar commented May 13, 2026

Copy link
Copy Markdown
Member

run() snapshots the m.channels slice header under RLock and iterates after unlocking. remove() used append(channels[:i], channels[i+1:]...) which shifts the shared backing array in-place, corrupting the concurrent iteration — a subscriber can receive a value twice while another misses it entirely.

Fix remove() to allocate a new backing array via slices.Concat so the old slice that run() holds remains stable. This moves the allocation to the cold path (subscriber disconnect) instead of the hot path (every fan-out delivery).

Fixes: a67f983 ("fix(envd): fix fan-out deadlock when process subscriber disconnects (#2579)")

@codecov

codecov Bot commented May 13, 2026

Copy link
Copy Markdown

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
2620 2 2618 5
View the full list of 6 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.87% (Passed 179 times, Failed 595 times)

Stack Traces | 39.1s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
Executing command dig in sandbox iq7ejxcod28u27tyy1ylo
--- FAIL: TestUpdateNetworkConfig (39.14s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.30% (Passed 173 times, Failed 589 times)

Stack Traces | 4.61s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox ikt1ptaxmhyu15pg5y09l
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1366}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
Executing command curl in sandbox ikt1ptaxmhyu15pg5y09l
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1367}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1368}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Thu, 14 May 2026 07:44:30 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ikt1ptaxmhyu15pg5y09l
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (4.61s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 56.90% (Passed 309 times, Failed 408 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_127_0_0_1

Flake rate in main: 58.89% (Passed 171 times, Failed 245 times)

Stack Traces | 7.16s run time
=== RUN   TestBindLocalhost/bind_127_0_0_1
=== PAUSE TestBindLocalhost/bind_127_0_0_1
=== CONT  TestBindLocalhost/bind_127_0_0_1
Executing command python in sandbox in77n4i388nt57zlq7z64
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1252}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_127_0_0_1
        	Messages:   	Unexpected status code 502 for bind address 127.0.0.1
--- FAIL: TestBindLocalhost/bind_127_0_0_1 (7.16s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 66.54% (Passed 179 times, Failed 356 times)

Stack Traces | 77.6s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (77.61s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 67.44% (Passed 169 times, Failed 350 times)

Stack Traces | 32.2s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1257}}
Executing command bash in sandbox ir68ncq68jn6y8391inux (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 190 MB\nFree memory before tmpfs mount: 795 MB\nMemory to use in integrity test (80% of free, min 64MB): 636 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"636+0 records in\n636+0 records out\n666894336 bytes (667 MB, 636 MiB) copied, 27.7543 s, 24.0 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=636\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 27.29\n\tPercent of CPU this job got: 97%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:27.90\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2728\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) page faults: 343\n\tVoluntary context switches: 4\n\tInvoluntary context switches: 273\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 835 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox iweak4m3bijcza7f1sne2
Executing command bash in sandbox iweak4m3bijcza7f1sne2 (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1273}}
    sandbox_memory_integrity_test.go:75: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:75
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox iweak4m3bijcza7f1sne2: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (32.21s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8b9a572a6d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/envd/internal/services/process/handler/multiplex_test.go Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition in the MultiplexedChannel by creating a shallow copy of the subscribers slice before iteration, which prevents slice corruption during concurrent modifications. A regression test has been added to verify the fix and ensure stable delivery to subscribers. I have no further feedback.

Comment thread packages/envd/internal/services/process/handler/multiplex.go
@arkamar arkamar force-pushed the fix/envd-subscriber-data-race branch from 8b9a572 to 17fdd0a Compare May 13, 2026 15:05
Comment thread packages/envd/internal/services/process/handler/multiplex.go
Comment thread packages/envd/internal/services/process/handler/multiplex_test.go Outdated
@cursor

cursor Bot commented May 13, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Touches concurrency in the process event fan-out path; a mistake could cause dropped/duplicated events or panics under load, though the change is small and localized.

Overview
Fixes a race where remove() could mutate the shared backing array of m.channels while run() was iterating, leading to duplicated/missed deliveries; it now rebuilds the slice with slices.Concat to keep the snapshot stable, and bumps pkg.Version to 0.5.18.

Reviewed by Cursor Bugbot for commit a354a07. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/envd/internal/services/process/handler/multiplex_test.go Outdated
Comment thread packages/envd/internal/services/process/handler/multiplex_test.go Outdated
@arkamar arkamar force-pushed the fix/envd-subscriber-data-race branch from 40f5326 to 8756a4c Compare May 13, 2026 15:33

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8756a4c722e294965dc13379a942b911c06fe157. Configure here.

Comment thread packages/envd/internal/services/process/handler/multiplex_test.go Outdated
Comment thread packages/envd/internal/services/process/handler/multiplex.go Outdated
Comment thread packages/envd/internal/services/process/handler/multiplex.go
arkamar added 4 commits May 14, 2026 09:08
…eration

run() snapshots the m.channels slice header under RLock and iterates
after unlocking. remove() used append(channels[:i], channels[i+1:]...)
which shifts the shared backing array in-place, corrupting the
concurrent iteration — a subscriber can receive a value twice while
another misses it entirely.

Fix remove() to allocate a new backing array via slices.Concat so the
old slice that run() holds remains stable. This moves the allocation
to the cold path (subscriber disconnect) instead of the hot path
(every fan-out delivery).

Fixes: a67f983 ("fix(envd): fix fan-out deadlock when process subscriber disconnects (#2579)")
@arkamar arkamar force-pushed the fix/envd-subscriber-data-race branch from 340d4c1 to 5431989 Compare May 14, 2026 07:10
@arkamar arkamar requested a review from jakubno May 14, 2026 07:13
@arkamar arkamar enabled auto-merge (squash) May 14, 2026 07:35
@arkamar arkamar merged commit a2bb1cb into main May 14, 2026
53 checks passed
@arkamar arkamar deleted the fix/envd-subscriber-data-race branch May 14, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants